Honaker, King, and Blackwell (2011)
Blackwell, Honaker, and King (2017)
https://gking.harvard.edu/category/research-interests/methods/missing-data
Missing data poses a problem. Missingness biases inference. However, missingness need not be intractable. We can use regression to predict missing values when the missing data are random conditional on known features of the data,1 Where MAR assumptions are satisfied we predict the missing observations using regression. Because regression is uncertain, we must generate multiple datasets with observations. This process of generating multiple datasets is called multiple imputation. Before proceeding with an explanation of the mechanics of multiple imputation, we take a moment to recollect two fundamental principles of datascience that we have been discussing throughout this course:
The practice of statistical inference is educated guesswork. We never observe population parameters, we only observe samples. We assume that our samples are drawn randomly from an unobserved population. We we move from measured features of our sample to unmeasured features of a population we are projecting with uncertainty. Multiply imputing missing data is no different. ultiple imputation is continuous with a data-scientist’s larger mission of guessing wisely[^1]: For a explanation, see: here
Regression is a tool prediction and regression is also a tool for causal inference. How we go about using regression differs depnding on wheher we want to predict or whether we want to explain. When we regress to predict it is generally useful to include all relevant indicators in our dataset that might be related to the outcome and that are not collinear with other indicators in our dataset.2
Prediction: we have discussed strategies for assessing improvements in model fit for prediction using criteria such as the BIC criterion, the WAIC criterion, and LOO.
Gelman and Hill offer the following advice about inclusion of variables when using regression to predict (Gelman and Hill 2006, 69):
Our general principles for building regression models for prediction are as follows:
- Include all input variables that, for substantive reasons, might be expected to be important in predicting the outcome.
- It is not always necessary to include these inputs as separate predictors—for example, sometimes several inputs can be averaged or summed to create a “total score” that can be used as a single predictor in the model.
- For inputs that have large effects, consider including their interactions as well.
- We suggest the following strategy for decisions regarding whether to exclude a variable from a prediction model based on expected sign and statistical significance (typically measured at the 5% level; that is, a coefficient is “statistically significant” if its estimate is more than 2 standard errors from zero):
- If a predictor is not statistically significant and has the expected sign, it is generally fine to keep it in. It may not help predictions dramatically but is also probably not hurting them.
- If a predictor is not statistically significant and does not have the expected sign (for example, incumbency having a negative effect on vote share), consider removing it from the model (that is, setting its coefficient to zero).
- If a predictor is statistically significant and does not have the expected sign, then think hard if it makes sense. (For example, perhaps this is a country such as India in which incumbents are generally unpopular; see Linden, 2006.) Try to gather data on potential lurking variables and include them in the analysis.
- If a predictor is statistically significant and has the expected sign, then by all means keep it in the model. These strategies do not completely solve our problems but they help keep us from making mistakes such as discarding important information. They are predicated on having thought hard about these relationships before fitting the model. It’s always easier to justify a coefficient’s sign after the fact than to think hard ahead of time about what we expect. On the other hand, an explanation that is determined after running the model can still be valid. We should be able to adjust our theories in light of new information (Gelman and Hill 2006, 69).
Causal inference: do not include all predictors that might be associated with the outcome. Doing so is an invitation to confounding causal inference. To obtain an unbaised estimate of an exposure’s causal effect on the outcome(s) of interest we must close all backdoor paths from a predictor to an outcome and we must avoid conditioning on any colliders. As Richard McElreath writes [@mcelreath2020a, p.46]:
But the approach which dominates in many parts of biology and the social sciences is instead CAUSAL SALAD.36 Causal salad means tossing various “control” variables into a statistical model, observing changes in estimates, and then telling a story about causation. Causal salad seems founded on the notion that only omitted variables can mislead us about causation. But included variables can just as easily confound us. When tossing a causal salad, a model that makes good predictions may still mislead about causation. If we use the model to plan an intervention, it will get everything wrong. [@mcelreath2020a, p.46]
To help assist us with the problems of unbiased causal inference we have discussed the utility of building causal DAGs, noting that inference is always conditional on the DAG that we have drawn. Frequently, the data under-determine the DAG. There are many DAGs consistent with our observations. Regression is a powerful tool for reliable causal inference but it is not sufficient for reliable inference.
So which approach to regression should we use for multiple imputation. The answer is that we should use the predictive approach. When we are imputing we are are predicting features of a population. We shoudl therefore follow the Gelman and Hill’s advice to stratifying across all indicators that might improve predictive accuracy.
When you are examining your data, make sure to assessing missing responses. A handy tool for assessing missingess is a missingness graph.
Let’s consider subsample of the nz-jitter longitudional dataset that has responded to multiple waves. Today we will ask whether employment insecurity is causally related to charitable donations. To predict missingness we should probably include other indicators besides those that I have included here, but these indicators will be sufficient for our purposes.
df <- nz12 %>%
select(
Id,
CharityDonate,
Emp.JobSecure,
Household.INC,
Hours.Work,
Male,
Employed,
Pol.Orient,
Relid,
Wave,
yearS,
KESSLER6sum,
Partner,
Age,
yearS
)%>%
dplyr::mutate(Partner = factor(Partner))
# always inspect your dataframe
glimpse(df)
Rows: 4,140
Columns: 14
$ Id <fct> 15, 15, 15, 15, 15, 15, 15, 15, 15, …
$ CharityDonate <dbl> 20, 0, 5, 10, 70, 0, 170, 160, 80, 1…
$ Emp.JobSecure <dbl> 4, 6, 6, NA, 7, 5, NA, 7, NA, 7, NA,…
$ Household.INC <dbl> 80000, 70000, 56000, 65000, 85000, 8…
$ Hours.Work <dbl> NA, NA, 0, 0, 15, 25, 0, 18, 0, 14, …
$ Male <fct> Male, Male, Male, Male, Male, Male, …
$ Employed <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, …
$ Pol.Orient <dbl> 3, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 4, …
$ Relid <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, …
$ Wave <fct> 2010, 2011, 2012, 2013, 2014, 2015, …
$ yearS <dbl> 27, 347, 834, 1200, 1608, 2037, 2336…
$ KESSLER6sum <int> 4, 4, 4, 4, 3, 4, 5, 4, 3, 5, 1, 2, …
$ Partner <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ Age <dbl> 38.82820, 39.70431, 41.03765, 42.036…
How many individuals?
nz12 %>%
group_by(Wave) %>%
summarise(Unique_Id = n_distinct(Id))
# A tibble: 10 x 2
Wave Unique_Id
<fct> <int>
1 2010 414
2 2011 414
3 2012 414
4 2013 414
5 2014 414
6 2015 414
7 2016 414
8 2017 414
9 2018 414
10 2019 414
We can visualise the data, using the naniar package:
We can see substantial missingness for Emp.JobSecure.
Let’s explore this:
df%>%
select(Wave, Emp.JobSecure) %>%
group_by(Wave)%>%
tally(is.na(Emp.JobSecure))
# A tibble: 10 x 2
Wave n
<fct> <int>
1 2010 89
2 2011 106
3 2012 118
4 2013 117
5 2014 129
6 2015 133
7 2016 414
8 2017 167
9 2018 172
10 2019 183
Lot’s of missingness in Emp.JobSecure and the question was not included in 2016
table1::table1(~ Wave|Emp.JobSecure, data = df, overall = FALSE)
| 1 (N=90) |
2 (N=96) |
3 (N=139) |
4 (N=268) |
5 (N=403) |
6 (N=731) |
7 (N=785) |
|
|---|---|---|---|---|---|---|---|
| Wave | |||||||
| 2010 | 12 (13.3%) | 12 (12.5%) | 18 (12.9%) | 31 (11.6%) | 68 (16.9%) | 90 (12.3%) | 94 (12.0%) |
| 2011 | 16 (17.8%) | 14 (14.6%) | 15 (10.8%) | 45 (16.8%) | 43 (10.7%) | 84 (11.5%) | 91 (11.6%) |
| 2012 | 11 (12.2%) | 10 (10.4%) | 17 (12.2%) | 41 (15.3%) | 43 (10.7%) | 86 (11.8%) | 88 (11.2%) |
| 2013 | 11 (12.2%) | 11 (11.5%) | 17 (12.2%) | 35 (13.1%) | 44 (10.9%) | 90 (12.3%) | 89 (11.3%) |
| 2014 | 10 (11.1%) | 10 (10.4%) | 16 (11.5%) | 28 (10.4%) | 40 (9.9%) | 91 (12.4%) | 90 (11.5%) |
| 2015 | 5 (5.6%) | 10 (10.4%) | 18 (12.9%) | 27 (10.1%) | 46 (11.4%) | 78 (10.7%) | 97 (12.4%) |
| 2016 | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| 2017 | 7 (7.8%) | 9 (9.4%) | 12 (8.6%) | 21 (7.8%) | 46 (11.4%) | 71 (9.7%) | 81 (10.3%) |
| 2018 | 8 (8.9%) | 12 (12.5%) | 13 (9.4%) | 23 (8.6%) | 35 (8.7%) | 73 (10.0%) | 78 (9.9%) |
| 2019 | 10 (11.1%) | 8 (8.3%) | 13 (9.4%) | 17 (6.3%) | 38 (9.4%) | 68 (9.3%) | 77 (9.8%) |
There are various methods for multiple imputation. First, let’s look at the Amelia package
library(Amelia)
# set seed
set.seed(1234)
# we need to pass a dataframe to Amelia
prep <- as.data.frame(df) # tibble won't run in amelia !!
# this is the key code
prep2 <- Amelia::amelia(
prep,
#dataset to impute
m = 10,
# number of imputations
cs = c("Id"),
# the cross sectional variable
ts = c("yearS"),
# Time series, allowing polynomials
#ords = none in this dataset, but use this command for ordinal data
#logs = , # big numbers better to use the natural log
sqrt = c("KESSLER6sum", "CharityDonate"),
# skewed positive data such as K6
noms = c("Male", # nominal vars
"Employed",
"Partner"),
idvars = c("Wave"),
# not imputing outcomes
polytime = 3
) #https://stackoverflow.com/questions/56218702/missing-data-warning-r
-- Imputation 1 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
-- Imputation 2 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
-- Imputation 3 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
101 102 103 104 105
-- Imputation 4 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89
-- Imputation 5 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52
-- Imputation 6 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68
-- Imputation 7 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71
-- Imputation 8 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71
-- Imputation 9 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
-- Imputation 10 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52
# Impute again but do not impute the outcome
# this is the key code
prep2.1 <- Amelia::amelia(
prep,
#dataset to impute
m = 10,
# number of imputations
cs = c("Id"),
# the cross sectional variable
ts = c("yearS"),
# Time series, allowing polynomials
#ords = none in this dataset, but use this command for ordinal data
#logs = , # big numbers better to use the natural log
sqrt = c("KESSLER6sum"),
# skewed positive data such as K6
noms = c("Male", # nominal vars
"Employed",
"Partner"),
idvars = c("Wave","CharityDonate"), # We do not impute the outcome this time
# not imputing outcomes
polytime = 3
) #https://stackoverflow.com/questions/56218702/missing-data-warning-r
-- Imputation 1 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
-- Imputation 2 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22
-- Imputation 3 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46
-- Imputation 4 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
-- Imputation 5 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71
-- Imputation 6 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
-- Imputation 7 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74
-- Imputation 8 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
81 82 83 84
-- Imputation 9 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
-- Imputation 10 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53
We can center and scale our variables using the following code head(df)
# prepare data with imputed outcome
prep3 <- Amelia::transform.amelia(
prep2,
Id = as.factor(Id),
# redundant
Age.10yrs = (Age / 10),
years_s = scale(yearS, center = TRUE, scale = TRUE),
years = yearS,
KESSLER6sum_S = scale(KESSLER6sum, center = TRUE, scale =TRUE),
Employed = factor(Employed),
Relid = scale(Relid, scale = TRUE, center = TRUE),
Male = as.factor(Male),
Emp.JobSecure_S = scale(Emp.JobSecure, center = TRUE, scale = FALSE),
CharityDonate = as.integer(CharityDonate)
)
# center an d scale age
out <- Amelia::transform.amelia(prep3, Age_in_Decades_C = scale(Age.10yrs,scale =FALSE, center=TRUE))
prep3.1 <- Amelia::transform.amelia(
prep2.1,
Id = as.factor(Id),
# redundant
Age.10yrs = (Age / 10),
years_s = scale(yearS, center = TRUE, scale = TRUE),
years = yearS,
KESSLER6sum_S = scale(KESSLER6sum, center = TRUE, scale =TRUE),
Employed = factor(Employed),
Relid = scale(Relid, scale = TRUE, center = TRUE),
Male = as.factor(Male),
Emp.JobSecure_S = scale(Emp.JobSecure, center = TRUE, scale = FALSE),
CharityDonate = as.integer(CharityDonate)
)
# center an d scale age
out.1 <- Amelia::transform.amelia(prep3.1, Age_in_Decades_C = scale(Age.10yrs,scale =FALSE, center=TRUE))
We can use trace plots to examine how Amelia imputed and with how much uncertainty:
Here are the imputations for Job Security (a random selection)
Amelia::tscsPlot(
prep2,
cs = c("15", "19", "20", "39", "549", "861","1037","1078","1253"),
main = "Imputation of Job security",
var = "Emp.JobSecure",
ylim = c(0, 30)
)
Here are the imputations for Charity (another random selection). Note that there is a fair amount of uncertainty here:
Amelia::tscsPlot(
prep2,
cs = c("394", "1039", "1082", "1149", "1238" , "1253", "1229","15", "19", "20","1037","1078"),
main = "Impuatation of Charity",
var = "CharityDonate",
ylim = c(0, 10000)
)
Which variables do we need to include to estimate the causal effect of job security on charity?
Write your dag!
library(ggdag)
dg <-
dagify(
charity ~ jobsecure + employed + age + male + relid + years + polconserve,
jobsecure ~ employed + distress + male + years + age,
distress ~ male + employed + years + age + partner,
relid ~ male + years,
age ~ years,
labels = c(
"charity" = "charity",
"jobsecure" = "job security",
"employed" = "employed",
"polconserve" = "political conservative",
"partner" = "partner",
"age" = "age",
"male" = "male",
"relid" = "religious identity",
"years" = "years"
),
exposure = "jobsecure",
outcome = "charity"
) %>%
tidy_dagitty(layout = "nicely")
ggdag::ggdag_adjustment_set(dg)
To obtain an unbiased estimate of jobsecurity on charity we must condition on employed, male, age, and years.
We can write the model using the lme4 package, which is fast. I wrote a little function, recall that we have 8 data sets
# first write out the model equation
library(lme4)
mod_eq <- 'CharityDonate ~ Emp.JobSecure_S + Employed + Age_in_Decades_C + Male + years_s + (1|Id)'
# run models iterating over imputed data
loop_glmer_model <-
function(x, y) {
# x is the mod equation, y is the data
m <- 10
mod <- NULL
for (i in 1:m) {
mod[[i]] <- lme4::glmer(x, data = y$imputations[[i]], family = "poisson")
}
return(mod)
}
m_list <- loop_glmer_model(mod_eq, out)
# try with out the outcome imputed
m_list1 <- loop_glmer_model(mod_eq, out.1)
Here is a function for obtaining the results:
# table of effects
loop_lmer_model_tab <- function(x) {
mp <- lapply(x, model_parameters)
out <- parameters::pool_parameters(mp)
return(out)
}
# create table
tab_impute <- loop_lmer_model_tab(m_list)
tab_impute
# Fixed Effects
Parameter | Coefficient | SE | 95% CI | Statistic | p
----------------------------------------------------------------------------
(Intercept) | 6.34 | 0.59 | [ 5.18, 7.50] | 10.72 | < .001
Emp.JobSecure_S | 0.01 | 0.03 | [-0.05, 0.07] | 0.41 | 0.678
Employed [1] | 0.15 | 0.02 | [ 0.10, 0.20] | 5.95 | < .001
Age_in_Decades_C | -4.79 | 1.29 | [-7.31, -2.26] | -3.72 | < .001
Male [Not_Male] | -1.48 | 0.80 | [-3.04, 0.08] | -1.86 | 0.063
years_s | 1.38 | 0.38 | [ 0.63, 2.12] | 3.63 | < .001
SD (Intercept) | 6.79 | | | |
SD (Observations) | 1.00 | | | |
# create graph
plot_impute <- plot(tab_impute)
plot_impute
We can plot the predictions:
library(ggeffects)
library(gghighlight) # not used here, useful for interactions
graph_predictions_imputed <- function(x, y) {
# x = model objects
m <- 10
out <- NULL
for (i in 1:m) {
out[[i]] <-
ggeffects::ggpredict(x[[i]], terms = c("Emp.JobSecure_S"))
}
plots <- NULL
for (i in 1:m) {
plots[[i]] <-
plot(out[[i]], facets = T) # + scale_y_continuous(limits=c(6.35,6.85) )
}
plots[[10]] +
gghighlight::gghighlight() +
ggtitle(y)
}
# graph
graph_predictions_imputed(m_list,"Prediction of jobsecurity on charity (not reliable")
What if we do not impute the outcome? Does that make a difference? Not much:
loop_lmer_model_tab <- function(x) {
mp <- lapply(x, model_parameters)
out <- parameters::pool_parameters(mp)
return(out)
}
# create table
tab_impute1 <- loop_lmer_model_tab(m_list1)
tab_impute1
# Fixed Effects
Parameter | Coefficient | SE | 95% CI | Statistic | p
----------------------------------------------------------------------------
(Intercept) | 6.04 | 0.47 | [ 5.12, 6.97] | 12.84 | < .001
Emp.JobSecure_S | 0.02 | 0.04 | [-0.06, 0.10] | 0.48 | 0.629
Employed [1] | 0.11 | 0.03 | [ 0.06, 0.17] | 3.86 | < .001
Age_in_Decades_C | -4.58 | 0.38 | [-5.34, -3.83] | -11.92 | < .001
Male [Not_Male] | -1.50 | 0.59 | [-2.65, -0.35] | -2.55 | 0.011
years_s | 1.29 | 0.11 | [ 1.08, 1.50] | 12.10 | < .001
SD (Intercept) | 6.65 | | | |
SD (Observations) | 1.00 | | | |
# create graph
plot_impute1 <- plot(tab_impute1)
library(patchwork)
plot_impute / plot_impute1 + plot_annotation()
If you want a LaTeX table, you can use this code:
library(huxtable)
huxtable::as_hux( your_model_here ) %>%
select("Parameter", "Coefficient", "CI_low", "CI_high", "p") %>%
set_number_format(3) %>%
set_left_padding(20) %>%
set_bold(1, everywhere) %>%
quick_latex()
When you run a regression with missing data, R automateically deletes the missing cases.
Let’s look at the results from the row-wise deleted data:
# prepare data as we did for the imputated dataset
df2 <- df %>%
dplyr::mutate(
Age.10yrs = (Age / 10),
Age_in_Decades_C = scale(Age.10yrs, scale = FALSE, center = TRUE),
years_s = scale(yearS, center = TRUE, scale = TRUE),
years = yearS,
KESSLER6sum_S = scale(KESSLER6sum, center = TRUE, scale = TRUE),
Employed = factor(Employed),
Relid = scale(Relid, scale = TRUE, center = TRUE),
Male = as.factor(Male),
Emp.JobSecure_S = scale(Emp.JobSecure, center = TRUE, scale = FALSE)
)
# run the row-wise deletion model
m_no_impute <- glmer(mod_eq, data = df2, family = "poisson")
# create table
tab_no <-
parameters::model_parameters(m_no_impute, effects = c("all"))
tab_no
# Fixed Effects
Parameter | Log-Mean | SE | 95% CI | z | p
--------------------------------------------------------------------------
(Intercept) | 5.26 | 0.08 | [ 5.09, 5.42] | 62.00 | < .001
Emp.JobSecure_S | -5.65e-03 | 5.87e-04 | [-0.01, 0.00] | -9.61 | < .001
Employed [1] | 0.19 | 6.46e-03 | [ 0.18, 0.20] | 29.18 | < .001
Age_in_Decades_C | -0.89 | 0.04 | [-0.96, -0.81] | -21.86 | < .001
Male [Not_Male] | -0.64 | 0.11 | [-0.86, -0.43] | -5.79 | < .001
years_s | 0.18 | 0.01 | [ 0.16, 0.21] | 15.41 | < .001
# Random Effects
Parameter | Coefficient
--------------------------------
SD (Intercept: Id) | 1.00
SD (Residual) | 1.00
# create graph
plot_no <- plot(tab_no)
plot_no
When we compare the graphs, we see that the multiply imputed datasets shrink estimates towards zero.
Multiple imputation is sometimes avoided because people don’t like to “invent” data. However, creating multiply imputed datasets and integrating over their uncertainty during model tends to increase uncertainty in a model. That’s generally a good thing when we want to predict features of the population.
library(patchwork)
# compare models graphically
plot_impute / plot_no +
plot_annotation(title = "Comparison of regressions using (a) multiple-imputed and (b) row-wise deleted datasets", tag_levels = 'a')
However it would be a mistake to think that multiple imputation is sufficient. Missingness might not have been completely at random conditional on variables in your model. In this case, your multiple imputation cannot help your dataset to avoid bias.
You’ve come very far in only twelve weeks. Among other things, you have acquired the following skills:
Statisticians call the assumption of of conditional randomness in missingness: “MAR: Missing at Random.” The language is admittedly confusing – what statisticians mean is “missing conditional on known predictors.”↩︎
Although it generally does no harm to include collinear indicators if our task is prediction↩︎